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IS THE RATING OF HUMAN CHARACTER 
PRACTICABLE? 


(Continued from November) 


HAROLD RUGG 
The Lincoln School of Teachers College, 


The conclusions from the study of official ratings and of the data 
collected at the Officers’ Personnel School,’ confirmed the conviction 
that another more elaborate and objectified experiment should be 
made in army camps. That was done in November 1918 at Camps 
Sheridan and Taylor. The experiment at Sheridan was organized 
largely for the purpose of working out the final procedure to be followed 
at Taylor. The data for both experiments are used. 

One hundred and fifty-one officers cooperated in the study. They 
met in two groups, one of 93, the other of 58. Practically all of these 
officers were college trained men. Their average score on the Army 
Psychological Test was Bt, indicating that we had a distinctly intelli- 
gent group of cooperating officers. I personally believe that the men 
engaged in the study represented as high a level of intelligence as will 
be found in typical groups of supervisory school officers. 

How the ‘‘ Experimental”? Ratings Were Made.—Each officer con- 
structed a rating scale in accordance with the following directions: 

First.—A list of officers was compiled which contained 25 or more 
names, made up largely from officers who were present in the con- 
ference room. The greatest care was taken to see that the names of 
_ Officers used on the scale were only those whom the rater felt qualified 
torate. It was felt then, and proven later, that it was most important 
that each person used on a scale should be known intimately by the 
rating officer. 


Second.—These names were arranged as nearly as possible in exact 


1 Presented in the November issue of this journal. 
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rank order of merit. This was done independently for each of the five 
groups of qualities on the scale. It is of the greatest importance that 
the scale be made separately for each group of qualities. From each 
of these lists five scale-men were selected. This was done by the 
very careful scrutiny and review of the two or three men at the head 
of the list on a given quality. From this the 15” men were selected. 
Similarly, the ‘‘3”’ men at the foot of the list were selected from the 
two or three poorest; likewise, with the others, the ‘‘12,” “9” and 
“6”? men. 

Third.—There was a thorough discussion, taken part in by all 
present, of the construction of the scale and, between the first and 
second conferences each rating officer studied and revised his scale. 

It is my judgment that the Camp Taylor scales represent excellence 
in scale-making that it will be difficult, if not impossible, to duplicate 
under working conditions in public schools. It is important to 
remember this in applying the conclusions of the army investigation 
of rating in school praetice. 

Fourth.—Each of the 151 officers gave me a copy of his scale in 
order that a detailed comparison might be made of the extent to 
which officers agreed who had used the same scale-men. Remember, 
that the officers had constructed their scale, so far as possible, from 
their associates in the conference room. 

Fifth—At the second half-day conference each officer rated on his 
new scale each other officer in the room whom he was positive that 
he knew well enough to rate. No officer was allowed to rate another 
unless he was sure that he was thoroughly qualified to do so. To 
make this effective and to supply statistical information, each officer 
stated the number of weeks that he had associated in the army with 
the officer rated and estimated the extent to which he was qualified 
to rate him. In addition, each officer submitted a list of the other 
officers present, who, in his judgment, were competent to rate him. 
These officers are referred to in my subsequent discussions as “com- 
petent raters.”’ 

Sizth.—A third half-day conference was held at which this entire 
procedure of revising scales and ratings was repeated. Thus, two 
conference ratings are available on each officer which might be properly 
regarded as made by officers who were thoroughly competent to rate. 

Seventh—Each officer supplied detailed personal data concerning 
his age, schooling, college honors, annual earnings, occupational 
activities, ete. 


i} 
At 3. 
, 
ak 
“ha 
3 
Af 
. 
one : 
| 


) 


Rating of Human Character 487 


Eighth—Each of the officers took the Army Psychological Test 
once, a number of them twice, and all of them took the Thorndike 
Alertness Test twice. 

Ninth.—A round-table discussion was held in which the officers 
stated the difficulties which they had encountered in constructing 
and using scales and in which they made recommendations for changes 
and improvements in the procedure. 


How CLosELty po Ratings AGREE WHEN MADE BY DIFFERENT 
RATERS ON THE SAME PERSON? 


From this elaborate experiment what does one learn about the 
reliability and practicability of rating human character? He learns 
that even under such carefully controlled conditions, ‘it is practically 
impossible to secure ratings on point scales which are reliable estimates 
of character. He learns first, for example, that when a person is rated 
independently by from three to thirteen competent raters, that the 
range in the ratings will commonly be as large as 30 points on a total 
scale of 80 points. In but one ease in a list of 16 officers was the 
extreme difference less than 20 points. In the case of 8 of the 16 
the range exceeded 30 points. Table II summarizes the data. 


IJ.—AVERAGE OF THE RANGE AND AVERAGE OF THE AVERAGE DEVIATION 
oF 3 TO 13 INDEPENDENT RATINGS ON A PERSON 


Average of the range Average of the average deviation 
| 
General notmasneries General Raters reported 
conference conference as competent 
group sated group to rate 
bi Less Ve Less Ve Less Ve Less 
quali- well wel well wel well wel well 


quali- | quali- | quali- | quali- | quali- | quali- | quali- 
re.| fied | fied | fied | fied | fied 


A. Includes all persons rated 
by 3 or more raters....... 16.9 17.6 15.6 21.5 6.3 7.7 7.0 8.4 


were rated by 3 or mor 


B. Includes only persons who 
raters in both groups..... { 18.7 


19.9 10.7 | 21.7 6.7 8.9 4.5 4.2 


| 

| 

No. of 18 | 18 3 | 3 19 2 4 

| 
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One learns furthermore that the typical variations in such inde- 
pendent ratings (as shown by the average deviation) are commonly 
in the neighborhood of 7 points (for the data of the first conference it 
was 7 points and at the second it was 6). The probability is too 
uncertain, therefore, that a single rating on a person will locate that 
person even within his proper “‘fifth”’ of the rating scale. The chances 
can not be more than 4 to 1 that any rating will be within 14 points of the 
person’s true rating. These results, therefore, confirm strikingly, the 
results obtained from the Fort Sheridan ratings and from those secured 
in the first experiment at Camp Sheridan. 

With such conclusions before us two questions press insistently for 

answer. First: Why do independent ratings made so_ carefully 
7 disagree so mony Second: Is the process of judging human char- 
disagreement ”’ ratings is not an adequate criterion for 
measuring the reliability of a scale? If two raters assign the same 
score to a person do these scores represent equivalent and comparable 
judgments of character? We have the data at hand upon which to 
base a rather careful discussion of these questions, 


Wuy po INDEPENDENT RATINGS VARY so GREATLY? 


Four possible causes of disagreements were explored: 

1. Lack of acquaintance. 

2. A tendency to rate high or low. 

3. The extent to which the analysis of “military abilities’ (which 
is involved in the construction and use of rating scales) rests upon 
mental backgrounds that are so distinctly different as to result in 
totally different placements of the same officer on different scales. 
It was recognized that such an analysis may even result in similar 
placements, and quite different evaluations of ability. Thus, two 
independent ratings on an officer, as well as two different scales, 
may represent a very different distribution and differentiation of 
ability. 

4. Is the process of discriminating clearly the elements of human 
character contributed to by so many complicating factors as to render- 
practically impossible close agreements in total estimates of the man? 
These complicating factors may include: (a) the ability to state 
objectively outstanding personal qualities; (b) the effect of peculiari- 
ties of special groups of qualities on the estimate of other qualities; 
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(c) the emphasis given to one group of qualities by one officer and to a 
different group by another; (d) the complexity of the task of holding 
in mind the various special traits which contribute to a group of 
qualities without elaborate devices for discriminating the subordinate 
elements. 

I shall treat the third and fourth factors in great detail. The dis- 
cussion, however, of the “‘effect of extent of acquaintances”’ and of a 
“tendency to rate high or low” must be inadequate. Data for the 
former are incomplete because of our care to permit only competent 
raters to rate. However, there were differences observable in ‘‘ extent 
of acquaintance” and I report the results as suggestive of more sweep- 
ing conclusions that I believe would obtain from more complete 
data. 

1. The Effect of Acquaintance—Comparison was made of the 
agreements of ratings for two groups of raters representing different 


degrees of acquaintance. The difference between the average devia- . 


tions of the two “‘acquaintance”’ groups is rather marked. The first 
set gave 8.9 for one group and 6.7 for the other; the second gave 7.7 
against 6.3. Furthermore, the average deviation for the “competent 
raters” was smaller than for any other group. This we found to be 
4 points. The data that we studied during the investigation were 
impressive pointing toward the conclusion that estimates of human 
character depend closely upon intimacy of acquaintance and that it is 
important to evaluate the competency of the rater. Another investi- 
gation has been reported to me in which the reliability of letters of 
recommendation was studied. The accuracy of the recommendation 
was correlated with certain objective methods, and the correlation 
ranged with the intelligence of the “judge” from 0.7 to 0. Evidence 
like this points to the importance of both intimacy of acquaintance 
and competency of judgment. 

2. The Tendency to Rate High or Low.—The dearth of ratings by 
any single officer caused great difficulty in treating this factor | thor- 
oughly. If the average rating had been based upon a reasonably 
large group, say 25 or more, a correction could have been applied by 
the following procedure: (a) the determination of the variability 
(by o or by Q) of each rater’s ratings; (b) the average of the ratings 
falls at 60; the estimated standard deviation is between 13 and 15, 
and the estimated Q is 10; (c) the ratings of each rater thus could be 
approximated by equating the position of each of his actual ratings 
on the scale to the position that it would have occupied on the scale 
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of the ‘“‘true”’ distribution. The average would be made 60 and each 
other rating would be increased or decreased by the relative difference 
between it and the average. 

To investigate this factor thoroughly the ratings of persons should 
be available, each person rating a considerable number of individuals. 
Only four officers in my study rated as many as 10 to 14 persons, and 
only one rated 15. From the experience gained in carrying through 
this inquiry it is doubted if, either in the army or in school practice, 
an experiment can be set up in which a larger number of valid ratings 
can be obtained on one person under natural rating conditions. 
For these reasons the analysis of the tendency to rate high or low 
included merely a careful approximate correction of ratings made by 
extremely high and extremely low raters. 

The results were all plotted and submitted in the original report. 
Lack of space prohibits their publication here. I shall ask the reader 
to accept my conclusion from the careful study of these meager data 
that divergencies in judgments of character are but slightly, if at all, 
accounted for by a tendency to rate high or to rate low. Hence, the 
great importance of studying carefully the third and fourth factors 
referred to previously. 

3. Incomparable Scales as Contributing Causes of Differences in 
Rating.—Of far greater importance is the question: Are the scales 
comparable on which ratings are made? Do the mental backgrounds 
against which estimates of character are made afford similar standards 
of judgment. Several basic questions must be answered: 

(a) When an officer is used by two or more raters, is he assigned 
the same scale value on the different scales? Is he a “fifteen” man on 
all scales, a man, a “‘six”’ man, etc.? fl 

(b) Does the placement of scale-men on the “intelligence” part 
of the rating scale correspond closely to placement in intelligence by 
the Army Psychological Tests? 

(c) Are differences in two ratings on an officer by different raters 
_ paralleled by corresponding differences in the positions of ‘‘scale-men’”’ 
on the scales used in the rating? 

(d) Is the same officer used as a “scale-man’’_on-.more than one 
group of qualities? 


ANALYSIS OF THE “‘SCALE-MEN”’ ON 45 RATING SCALES 


To answer these questions I made a minute analysis of 45 rating 
scales. The first step was to determine the extent to which.a given 
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scale-man was used at the same scale-value by different persons. 
Fortunately a large number of cases are available. Four hundred and 
fifty-eight instances were found on the scales in which the same men 
were used on from 2 to 13 different scales. Each instance of recurrence 
of a name on different scales was tabulated to give a semi-graphic 
chart upon which the following facts were shown: 

1. The specific scale-value at which the officer was used on each 
scale. 

2. The Psychological Test score and the Alertness Test score of 
each officer making the scale. te 

3. The Psychological Test score of the officers used on the scale. 

4. The total number of scales upon which the office? was used at 
each scale value of each quality. sit 

5. The per cent of the total number of times an officer was used 
on a given quality on which the concensus of agreement was used. 

6. The average position to which the officer was assigned on each 
quality. 

Conclusion.—lIt is impossible to print in this article the ten tables 
or the detailed charts which were made up to study this question. 
In Table III a summary is presented of the essential data. A minute 
analysis of the original tables and charts shows that rating scales 
made under as carefully controlled conditions as were those, are 
distinctly incomparable. Intervals upon different men’s scales do not 
represent closely the same amount of the trait in_question. Table III 
(typical of three other tables) shows that in not more than 50 to 60 per 
cent of the cases will an officer appear at the same scale value on different 
scales. It shows further that a divergence of 4% to 6 points (that is, 35 
to 50 per cent of the entire scale!) will typically occur between different 


placements of the same officer. It shows that the same man, used at 


“15” on one scale, will be used at “12” in another, ‘‘9” on others, 
“6” on occasional ones and even ‘‘3”’ on a few. Persons regarded 
as the “best captain I ever knew”’ were selected for scale positions on 
other scales as ‘‘the poorest captain I ever knew!” And the selection 
was made by the most objectified procedure we have vet devised and 
by a procedure that we probably cannot duplicate under practical 
rating conditions in education. 

But, in using the “‘agreement”’ of scale placement as our criterion, 
we are employing subjective methods. It happens that we have at 
hand a very good comparison of subjective ‘‘ratings”’ with objective 
measures of a particular group of traits,—namely, “intelligence” as 
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measured by the Army Psychological Test and by the Thorndike 
Alertness Test. The comparison of ratings and actual performance 
scores of 59 officers is presented and summarized in Table IV. These 


Taste IIJ.—Per Cent or Perrect AGREEMENT IN SCALE-POSITIONS, TOGETHER 
witH AVERAGE NUMBER OF UNITS OF DIVERGENCE FROM THE ‘“‘CONCENSUS”’ 
or AGREEMENT 
(All names used occurred on 5 or more Scales) 


Each row represents data on one officer 


Physical Intelligence Leadership Personal General value 
qualities qualities 
Aver- Aver- Aver- Aver- Aver- 
Per age Per age Per age Per age Per age 
cent | devia- | cent | devia- | cent | devia- | cent | devia- | cent | devia- 
tion tion tion tion tion 
| | | 
78 6.0 ahs 50 7 57 8.0 
100 oa 50 4.0 100 57 | 14.4 
75 3.0 
71 12.0 
83 3.0 50 3.0 
80 3.0 50 3.0 50 5.0 50 5 
60 3.0 
50 | 6.0 ir 60 | 6.0 
60 | 4.5 50 4.0 
50 | 8.0 
100 
40 | 8.0 40 7.0 60 | 24.0 
80 | 8.0 
60 3.0 
40 | 3.0 
86 3.0 ke. 
| 80 | 6.0 67 | 8.0 
60 8.0 
20 4.5 
| 40 4.0 | 
| 60 | 3.0 | 
60 6.0 | | 
80 | 9.0 50 | 13.3 
60 | 4.5 | | | 
Avernge....| 72 | 4.4 | 56 51 | 58) 46) | 6 63 | 11.9 


two sets of measures were made comparable by equating the range 
and variability of the “‘ratings’”’ to the range and variability of the 
test scores. Thus, a “fifth” of the test “scale” is comparable to a 
“fifth’’ of the rating ‘‘scale.”’ The detailed tables of the original 


report gave the following facts: (1) the average position on the intelli- 
gence part of the rating scale for men of given test score standings, say 
15,” “12,” “9,” “6,” and “3;” (2) the converse of these data. 
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We can now answer the question: Is “intelligence”’ discriminated 
accurately on the man-to-man comparison scale (assuming the valid- 
ity of our test measures of “intelligence’’)? blk 

It appears from Table IV that the men selected as “ most intelligent” 
(“‘15” men) were rather accurately picked. Of 13 men, 4 are rated 
“15,” 8 are rated “12,” and one is rated “9.” For the other four 
scale positions the discrimination is very inadequate. Men of “‘aver- 
age”’ intelligence (“‘9’’) as shown by the tests are rated higher than 
“superior”? persons (‘12’). The average test scores for the groups 
denoted “12,” “9,” “6” and “3” on the rating scales, are respec- 
tively 9.24, 10.29, 9.3 and 8.46. 

We find at the “lowest” end of the scale great differences between 
the rating on “intelligence”’ and the test score on 1. intelligence. Out 
of 28 “3” men on the rating scale only 5 appear as ‘‘3”’ men on the 
psychological tests; 5 appear as ‘‘6”’ men; 11 appear as “9”’ men on 
the test; 4 are placed at “12” on the test and three are “15”’ men on 
the test! No evidence which has been secured in this investigation 
makes more apparent the fact that the rating scale is thoroughly 
relative. ‘‘The least intelligent man I ever knew” according to the 
rating scale varies all the way from “the most intelligent”’ to ‘“‘the 
least intelligent’”’ as shown by psychological tests. The fact that 
psychological tests demand abilities which are not considered in the 
rating scale is not forgotten in this connection. Admittedly it 
should cause some lack of agreement between the test placing of a 
man and the placement by the rating scale. The two measures, 
however, clearly involve a very considerable number of common 
abilities. It is these common abilities that should operate to give a 
fairly close agreement between the two scales. The psychological 
test at least classifies together, reasonably well, officers who are 
nearly equal in ability. The rating scale should do this also. Hence, 
whether we expect agreement in scale placement or not, we would 
demand the same relative degree of homogeneity to come from the 
use of either scale. * 

With fewer cases, and hence with not so much reliability, the Camp 
Sheridan material confirms the judgment as to the “relative” aspect 
of the rating scale. It is this relative aspect of the scale that makes 
rating so different from test scores. It is believed that such an analy- 
sis as we are making, supports our earlier hypothesis that the scales 
themselves do not represent equivalent amounts of the trait. In this 
case, also, we find that the chances are not large that any one assign- 
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Taste IV.—CoMPaARISON OF THE PsycHOLOGICAL STANDING OF OFFICERS 

AT Camp TAYLOR WITH THEIR RATING SCALE-STANDINGS 

(Standings in the two scales are expressed in “fifths of the respective scales.”’ 
Numbers in the compartments of the table give the psychological test fifths) 


Number 
of 
officer 
makin 
the scale 


| 


test o 


officer 
making 
scale 


Rating scale fifths 


(If test scores agree with ratings the numbers in these col- 
umns will agree with the numbers at the head of columns) 


1 2 3 4 5 
“Lowest” | “Low” | “Middle” | “High” | “Highest” 

| 5 5 

5 3 

4 | | 5 4 

2 

ae 4 3 

4 

3 | 3 1 4 

5 

| | 5 

| 4 | 

1 

2 || 

3 2 4 3 

3 | | 4 4 4 

3 | 1 

1 

3 5 

he 2 

a 3 4 2 

3 4 

4 2 

3 1 | 3 5 

| 2 

2 2 
3 
3 
4 
| 3 
1 5 2 4 
4 

1 | 2 4 

4 

al 4 

2 

2 2 5 

H | 


— 
— 

. &§ 

ical Alertness 
} officer 
scale | | 
| 
| | 
| 
10 3 
12 4 
14 | 
15 
16 3 
17 | 
18 | 
1} 20 | 5 

22 | 
23 | 
24 

4 

28 2 
ran 30 3 
33 
MEG 34 
ot 35 e 2 
36 
| 
39 | 
40 | 
41 

42 
43 | 
44 
48 
49 . 
51 e 
52 

56 | 
oe Average.. bi | nf | 2.82 | 3.1 | 3.43 | 3.08 | 4. 23 
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ment of an officer to a definite scale position will represent closely the 
true scale placement of the man. 

In the course of our tabulations we studied the question: “Do 
raters of superior intelligence discriminate intelligence as mental tests 
do?” The answer is: “Slightly better than raters in general but yet 
not at all well.”” Tables were made comparing ratings and tests 
score of the men used on scales by officers of superior intelligence. 
Of 20 “5” men, 8 (40 per cent) were assigned to exactly the same 
position on the rating scale as they obtained in the test. But the 
average displacement for the whole group was 1.25 intervals, or 3:75 
units. Thus, taken as a group, officers of outstandingly superior 
intelligence discriminate intelligence in others little better than do 
those of ‘‘average”’ intelligence. The evidence we are piling up points 
constantly to the very great difficulty of estimating character. It 
certainly calls for further and more minute analysis of the process of 
judging human traits. 

Two points are clear then: first, most careful construction of the 
man-to-man comparison rating scales does not lead to close agreement 
in placing scale-men; second, experimental “ratings” of ability show 
distinctly inadequate agreement with test measures of ability. 

If a number of persons disagree widely in rating another person, it 
may be due, wholly or in part, to differences in evaluating the abilities 
of persons used at particular positions on the rating scale. We are 
ready, therefore, to study in detail the third important matter. 


ARE DIFFERENCES IN RATINGS PARALLELED BY CORRESPONDING 


DIFFERENCES IN THE PLACEMENT OF SCALE-MEN? | 


In the course of the discussion of this question we shall answer two 
related ones of the greatest importance: (a) if two ratings on a person 
agree closely, can it be inferred that each total rating is contributed to 
by approximately the same estimate of the component traits; (b) if 
scales can be found which do, throughout the range, represent nearly 
equivalent amounts of any trait, will ratings upon these scales closely 
approximate each other? 

These questions are of the highest psychological importance. Very 
elaborate plates and tables were prepared to answer them. This 
section of the original report, as submitted to the Army Committee on 
Classification of Personnel, discussed the situation for a group of 10 
officers. These 10 were selected for the comparison solely because they 
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were rated by eight or more different raters. They provide an impor- 
tant opportunity to study correspondence in overlapping scales. 
(I shall use ‘‘overlapping scales” to mean ‘‘scales which used one or 
more common names.’’) It is certain that these ten are representative 
of the entire group. i 


TaBLeE V.—PeER CENT oF 45 CASES IN WHICH THE SAME PERSON WAS USED ON 2 
SCALES FOR WHICH HE WAS ASSIGNED THE SAME VALUE 


29 60 
58 50 
36 21 
36 52 
37 37 


I can only print summary data, TablesVand VI. (Thedetailsare 
available to any student of the problem.) Table V shows that in only 
one-third (37 per cent) of the cases in which an officer is used on a scale 
will be assigned the same value. Table VI presents a Summary of the 


TABLE VI.—CoMPARISON OF THE DIFFERENCES IN TOTAL RATING OF Two RATERS 
on OnE OFFICER WITH THE DIFFERENCES IN THE VALUES TO WHICH THEY 
ASSIGNED COMMON-SCALE-MEN 


Second rating showed increase of 0—5 Second rating showed increase of 6-10 
points over first rating points over first rating 
Total numerical Total numerical 
; difference between _ difference between 
Number of instances the scale values on Number ofinstances the scale values on 
, in which the man __ the two scales on in which the man __ the two scales on 
Officer's was assigned to whichsame men wasassignedto which same men 
number _ were used at differ- | _ were used at differ- 
ent values ent values 
Same Different, | Same Different | | 
value values Second value values First Second 
on 2 on2 rating rating on 2 on 2 rating | rating 
scales scales | scales scales 
1 20 7 68 70 6 £ 51 36 
9 12 | 9 91 88 10 9 129 94 
12 3 2 2 1 16 8 
13 106 (87 1 | 12 138 162 
17 1 13 191 | 231 4 1 | 6 15 
18 5 5 71 55 1 | 5 74 106 
21 3 2 9 | 12 6 5 57 45 
27 7 147 174 10 «(ol | 127 148 
29 4 17 | 176 197 =! 6 | 11 | 92 98 
| j 
| 
Total...... 60 72 | 50 67 
132 117 | 
| 45.4% 54.6% ee Sas 42.7% 57.3% | 
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TaBLeE VI (Continued). —CoMPARISON OF THE DIFFERENCES IN TOTAL RATING OF 
Two Raters ON ONE OFFICER WITH THE DIFFERENCES IN THE VALUES TO 
wuHicn THey ASSIGNED COMMON-SCALE-MEN 


Second rating showed increase of 11-20 Second rating showed increase of more 
points over first rating than 20 points over first rating 
Total numerical Total numerical 
difference between difference between 
Number of instances | the scale values on | Number of instances the scale values on 
fficer’ in which the man the two scales on in which the man the two scales on 
Officer's assigned to which same men was assigned to which same men 
were used at differ- were used at differ- 
ent values ent values 
Same lA different Same (A.different 
value value First Second value value First Second 
on 2 on 2 rating rating on 2 on 2 rating rating 
scales scales scales scales 
1 x 9 187 143 
9 14 6 57 63 
12 5 3 12 33 1 2 15 27 
13 3 17 213 224 
17 8 14 156 156 
18 1 2 25 14 
21 =| 22 | 19 205 165 11 13 | 160 149 
27 7 12 113 174 4 10 101 128 
28 1 3 « 45 61 6 10 102 148 
29 1 2 9 18 2 2 18 27 
Total .... 70 97 ee Pe 24 38 Total number 
167 62 of overlap- 
41.8% 58.2% ae cas | 38.7% 61.3% pings = 478 
| 


data on 478 instances of the use of the same person on two scales. 
The detailed exhibit and Table XVIII lead clearly to the conclusion that 
differences in total rating are not parallel by even approximately 
corresponding differences in scale-values. 

Differences in assignment of scale values are discernible only when 
the differences in total rating become very large. This leads me to 
suggest an explanation for differences in rating: that the whole opera- 
tion is contributed to by many small differences in estimating the 
presence of subordinate groups of qualities; that many of these small 
differences neutralize each other in contributing to the total score, 
some being increases and some decreases; but that when the differences 
in estimate become very large, if can be shown statistically that these 
differences in placing men on the scale tend to parallel differences in 
rating on the scale. 

It is clear, therefore, that we need to explore in great detail the 
comparison of individuals with scale-men. In presenting the indi- 
vidual cases we shall answer the question: 
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When two persons agree closely in the total rating to be assigned 
another, can this agreement in numerical rating be interpreted to mean 
an agreement in judgment of character? I believe this inference can be 
drawn only when the total rating 1s contributed to by equivalent estimates 
of component traits against equivalent scale-men. Specifically, if two 
persons give another person the same rating, say 72 and 72, are the 
ratings contributed to by like agreements in comparing the person rated 
with the respective scale-men? Only by such coincidence in rating 
subordinate traits against equivalent standards of judgment (the scale- 
men) can we infer that agreement in total rating represents equivalence 
in estimates of total character, and not a mere chance situation. 

We need some specific cases to get the matter before us clearly. 
I shall discuss two distinctive conditions: the first, that in which two 
scales agree, that is, that the two scales consist of identical names 
at several values; the second, that in which the scales are apparently 
very different (the same person appearing on the two scales at different 
values), but in which ratings made against them are the same. The 
scales of 59 persons have been compared, value by value, and all 
instances noted in which a given person appears in two or more scales. 
(Exhibits VIII and IX of the original report give the scales 
themselves. ) 

It was very difficult to set up natural conditions for constructing 
scales and for rating persons against them which at the same time 
would insure many instances in which, on two or more scales, the same 
scale position would be assigned to a given individual. We are for- 
tunate, in fact, to have one instance of two scales, those of officers No. 11 
and No. 24 in which the same officer appears at the same value in 8 of the 
25 places on the scale. Furthermore, No. 11 and No. 24 rated 9 persons 
in common. Hereafter I shall refer to particular rating officers by 
number. Here we have, for 9 persons, two independent ratings made 
against standards of judgment which presumably are based on like 
judgments of character. Is it not of great importance that this is the 
only instance (in 59 cases) in which I have been able to show that scale 
construction is based upon equivalent judgments of character? 

The scales of No. 11 and No. 24 for “physical qualities’”’ and for 
“intelligence” are reproduced exactly in Table VII. Note that the 
names or numbers of each scale-man and the numbers of each officer 
rated on both scales, are located at the proper value (from 15 to 3). 
In “physical qualities” the “15,” “9” and “3” men are the same on 
both scales. 
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TaBLe VII.—Comparison oF ‘“PuysicaL” AND “INTELLIGENCE” ScALES Con- 
sTRucTEeD BY No. 11 anp No. 24 ToGreraHer WITH THEIR RATINGS ON SAME 


OFFICERS 
Psycho- ‘ 
logical Average | Average| 2 No, 11's No. 
test | position! of con- | = scale a y = scale — y 
No. 11 to offi- No. 24 to offi- 
stand- = on ference | > | numbers > | numbers 
ingin | others | ratings | 4 of scale cers rated by | © of scale cers rated by 
both No. lland| ¢ both No. 11 and 
scale scales | on him officers officers 
fifths and No. 24 No. 24 
The Physical Qualities Scale 
} 
15 | Whitcomb | No. 36 15 | Whiteomb 
No. 36 
| No. 19, 17, 8 
1 7.8 | 63.0 12 No. 23 | No. 19, 17, 12,| 12 | Swift No. 37 
8, 37 
No. 12 
9 | Whiting | No. 22 9 | Whiting 
| No. 22 
2 8.3 63.0 6 No.32 —= Staker 6 | Rundle 
Staker 
3 | Shippen | No. 21 3 | Shippen No. 21 
The Intelligence Scale 
| 15 Swift | No. 36 |15|No.11 No. 36 
*5 10.5 64.5 | | 
| | | No. 19, 17, 
| | | Staker 
5 10.8 82.0 12 No. 17 No. 19, 17, 37,| 12 | No. 1 
12.0 63.9 Staker No. 8 
No. 12 
3 9.0 67.0  9| No.8 No. 12, 8 | 9 | Goodnow 
|. No. 22, 37 
11.0 55.6 6 No. 7 No. 22, 21 6 | No.7 No. 21 
° 11.0 | 55.6 
3 4.3 61.8 3 No. 4 3 
*3 4.3 | 51.8 | 


* No. 11, 1, 7, and 4 on No. 24’s scale. 


On the “intelligence” scale the and ‘‘3”’ men are identical. 
On the Leadership scale the “‘9’’ man is the same on the two scales. 
On Personal Qualities the “‘15”’ man is the same. And, finally, for 
“‘General Value” the ‘‘16”’ man is the same on the two scales. These 
eight names are common to the two scales and are assigned to the same 
scale-value. Here we have a case of two scales in which the Physical 
and Intelligence qualities may be regarded as representing about the 
same ‘“‘spread”’ or differentiation of the traits in question and in which 
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a unit interval on one scale must represent closely the same difference 
in amount of the trait as a unit interval on the other. 

Table VII reproduces the ratings on each quality and the total ratings 
given each of 9 officers on these two scales. Note that the differences 
in total rating are, respectively, 4, 3, 14, 4, 4, 4, 1, 9, 2, and the average 
difference for the 9 ratings is 5 points. The large divergence of 14 in 
one case is caused by a difference in rating on General Value in the 
case of No. 21, No. 11 having rated him 8 on General Value while No. 
24, rated him 23. The other large disagreement in total rating, that of 
9 points, occurs on Personal Qualities, the largest difference in a single 
quality being caused by one officer being rated 12 by No. 11 and 7 by 
No. 24. The physical scale is the only one that we are reasonably confident 
represents equivalent distribution of the trait and it is on this that there is 
almost perfect agreement in rating the 9 men. The differences between 
the physical rating given these 9 officers by Nos. 11 and 24 are, 
respectively, 2, 1,0,1,2,1,1,0,1. Furthermore, they are rating men 
who are distributed throughout all portions of the total scale. Jt is 
significent that the two ratings on a single man agree, irrespective of the 
amount of the trart that he represents. 

The same almost perfect agreement in rating men on these scales is 
found when we study the intelligence qualities. In only one of the 9 
cases is the divergence in rating intelligence more than 2 points. (The 
case of No. 37, a Major, rated on a Captain’s scale. It is worth noting 
that in rating Staker, another Major, the agreement on him by the two 
raters is within 1 point in 4 of the 5 qualities and within 2 points on 
General Value.) 

The study of such comparable scales shows that zt 7s possible, but 
extraordinarily difficult, for two rating officers to construct scales, the 
intervals of which will represent approximately equal differences and 
to make the man-to-man comparison that is necessary in supplying a 
total judgment of a person’s worth. In this analysis, the case that we 
have just discussed is the only one which has been found in which the 
construction of scales and ratings made upon them leads to a satisfactorily 
close agreement in the estimation of character. This instance, by being 
an exception to the rule, throws into sharp relief the fact that the esti- 
mation of character involves very striking difficulties. In our con- 
sideration of further examples, however, we should hold in mind the 
fact that two officers have constructed scales and estimated the 
presence or absence of complex traits on 9 other men with very close 
agreement. We should also remember that equal total ratings can be 
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interpreted to be equivalent estimates of character only when the two 
scales against which ratings are made, represent closely identical 
scale-values and “spread”’ of character. 

(In the January issue further detailed examples will be given of 
scale construction and of rating, together with a more extended inter- 
pretation of the data.) 
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RATE OF MENTAL GROWTH, AGES NINE TO 
FIFTEEN 


FOWLER D. BROOKS 
Johns Hopkins University 


The results of annual re-tests of one hundred seventy-one children, 
ages nine to fifteen, of various economic and social groups, (as to 
mental ability, more highly selected at the earlier ages than at the 
later ones) give evidence that the rate of mental growth, as measured 
annually, is very nearly a straight-line affair, and is approximately 
the same for each year for school population of these ages. 

In May, 1918, 1919, and 1920, at the Training School of the Mankato, 
Minnesota, State Teachers College, a battery of the following tests! 
were given to grades IV to IX, inclusive: 

Four Woodworth-Wells number-checking tests; quality of hand- 
writing on two specimens of written work handed in by the pupils in 
other subjects such as language, history, etc.; quality of handwriting 
and speed of handwriting from two tests given a week to ten days 
apart; sixty words from columns Q, 8, and U of the Ayres scale, dic- 
tated slowly in easy sentences, during three or four spelling periods 
on as many days; Thorndike Reading A2 and B, the x and y lists 
being given in alternate years, and only comparable parts of tests 
being used so that perfect score would be the same each year; Courtis 
Arithmetic, Form B, four fundamental operations, attempts and rights; 
Woody Arithmetic, Series A, four fundamental operations; Stone 
Reasoning; Composition; Woodworth-Wells Opposites; Pintner-Toops 
Revised Directions Test; Immediate Auditory Memory, Concrete and 
Abstract, Whipple’s lists; Memory for English equivalents of Italian 
words—nine different tests devised by the writer and three given 
each year; Woodworth-Wells Substitution-five geometrical forms; 
Letter-Digit Substitution, devised by the writer and the same three 
tests given each year; a reasoning test, part of an Omnibus test 
devised by Thorndike and McCall; Trabue’s Language Completion, 
scales C, B, and D—one each year; Thorndike Reading, Alpha 2. 
In 1920, Army Alpha and Thorndike Group Intelligence Test ITI, 
Series L, were given. 


1A complete account of this investigation is contained in a recent publi- 
cation of the Bureau of Publications, Teachers College, N. Y., entitled ‘““Changes 
in Mental Traits with Age, Determined by annual Re-Tests.”’ 
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One hundred four of the subjects were given two annual tests; 
sixty-seven were tested annually three times. 

Treatment of Data.—Each child’s score in each test was subtracted 
from his score in the same test the following year. This difference 
represents his improvement in one year and is positive or negative. 
For those tested three years the same procedure was followed, taking 
the difference between the first and second, and second and third 
testings. These improvements in each test have been grouped 
according to the ages of the children. May 15th, was the median 
date of testing each year, and so ages have been computed as of that 
date. The age given in all cases, is the child’s age on his last birthday. 

In each test the median yearly improvement in gross score for 
each age and for each sex has been computed. On account of the 
unreliability due to the small number of cases and to some of the 
tests themselves, no comparisons have been made between the results 
of the individual or separate tests. We can secure greater reliability 
by combining the test-results into three or four groups of similar 
functions. A difficulty in combining or comparing results of different 
tests is the different-sized units. For example, is an improvement of 
18.4 problems in Woody Arithmetic more or less than an improvement 
of 6.2 problems in Courtis Form B? Norsworthy, Thorndike, Wood- 
worth, Sleight, and others have used the procedure of expressing 
gross score points in different tests as functions of some measure of 
the variability of the respective tests. This tends to equalize the 
units. We have divided the median gains in gross score in each test 
by the arithmetic mean of the standard deviations of eleven, twelve, 
and thirteen year olds for each sex in that test. 

These median sigma gains in the separate tests could now be com- 
pared, but it is doubtful if any conclusions so drawn would be of 
much value. For example, in the handwriting in ordinary written 
work the boys made much greater gains from thirteen to fourteen, 
than from twelve to thirteen, while in the handwriting tests their 
improvement was at a constant rate for these two years. One could 
attempt to explain this by saying that the improvement in handwrit- 
ing as measured by the tests became so permanent by thirteen that it 
carried over into ordinary written work to a greater extent than before 
this. It seems wise, however, not to lose sight of more probable 
causes—the unreliability of single test results. Accordingly, we have 
combined the median sigma gains in the different tests into four groups 
of similar functions, by finding their arithmetic means. The four 
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groups of functions we may designate simpler, memory, higher or 
more complex, and informational functions. We have also found the 
arithmetic means of the median sigma gains of all the tests, giving us 


a composite of more than forty different tests. These are given in 
Table I. 


TasLe I.— MerAN Garns IN SimPLeR, Memory, H1GHER, INFORMATIONAL, AND 
CoMBINED FUNCTIONS, EXPRESSED IN AS THOUSANDTHS OF THE MBEAN 
STANDARD DeviaATION oF AGES ELEVEN, TWELVE AND 
THIRTEEN FOR Eacu Sex 


Informa- 

Simple Memory Higher ane Combined 
Age 

BiG|BjiG | 
9-10 615 | 745 | 770 | 802 ‘1013 951 |1034 | 806 | 872 | 803 
10-11 714 | 607 | 580 | 509 | 845 | 760 | 875 | 675 | 773 | 647 
11-12 705 | 419 | 532 | 428 | 726 | 650 | 507 | 573 | 643 | 561 
12-13 640 | 554 | 294 | 454 +675 | 589 | 434 | 467 | 570 | 542 
13-14 1019 | 692 | 095 | 479 678 683 502 | 403 | 647 | 596 
14-15 539 | 625 | 269 | 392 595 | 542 439 | 407 | 489 | 487 


TaBLE IT.—Q’s (EXPRESSED AS THOUSANDTHS OF THE MEAN S&S. D.’s) OF THE 
YEARLY GAINS FOR Eacuo AGE AND Sex IN Eacu Grovp or SIMILAR 
FUNCTIONS, AND IN THE COMBINED GROUP 


Simpler Memory Higher seat Combined 
Age 
B G B G B G B G B G 
9-10 465 | 503 | 863 |1187 | 398 | 456 | 413 | 333 | 485 | 545 
10-11 509 | 587 | 681 | 652 | 506 | 426 | 324 | 337 | 487 | 470 
11-12 500 | 340 | 523 | 645 | 402 | 374 | 391 | 306 | 437 | 391 
12-13 448 | 357 | 742 | 529 | 452 | 449 | 314 | 290 | 460 | 403 
13-14 519 | 407 | 520 | 603 | 523 | 471 | 302 | 278 | 466 | 430 
14-15 548 | 375 | 434 | 583 | 629 | 532 | 311 | 384 | 504 | 471 


Figures 1 to 4 show the rates of gain from year to year. Since the 
average gain per year is very nearly 0.6c, we have used the same 
distance to represent 0.60 in the vertical scale as is used to represent 
one year in the horizontal scale. Figure 5 shows the rate of improve- 
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ment for all tests combined, and also shows the 25 percentile and 75 
percentile curves for the combination of all tests. 

Results——In the case of the simpler functions tested by number- 
checking and handwriting tests, Table I and Fig. 1 show but two 
departures from a constant rate of growth—the drop in the girls’ curve 
at eleven to twelve, and the sudden rise in the boys’ curve at thirteen 
to fourteen. In memory functions (Fig. 2) there is for boys a sudden 
drop at thirteen to fourteen. In memory, higher and informational 
functions, and in the composite of all tests the curves show less gain 
at the later ages than at the earlier ages. 

The irregularities of the curves (Figs. 1 and 2) are probably best 
explained in terms of the reliability, rather than by such theories 
as pre-adolescent slowing down, or adolescent spurt; in fact, adolescent 
spurt is sorely taxed to explain the boys’ curves at thirteen to fourteen 
in simpler and memory functions and at the same time explain their 
curves in higher and informational functions; especially since the 
groups tested were exactly the same in both cases. 

When we combine the results of more tests we get a smoothing of 
the curves. When we take the curves for each of the tests we find 
great irregularity. In another connection! we have shown that 
increasing the number of tests and increasing the number of cases has 
the effect of smoothing the curves. But, other things being equal, 
this increase in the number of tests, or in number of cases, is the means 
of securing more adequate measures of the functions for any age 
group and increases the reliability. It seems wise, therefore, to regard 
the irregularities in the curves as probably of chance occurrence. 

In considering the apparent decrease in rate of growth in memory, 
higher, and informational functions, the following facts must be kept 
clearly in mind: 

No nine or ten-year-olds below the advanced fourth grade were 
tested. No children beyond the third year of the junior high school 
(grade IX) were tested. These limits cut off younger children of 
less ability and older children of more ability. Dewey, Child, and 
Ruml,? making a careful selection from three thousand New York City 


1 Brooks, Fowler D.: “Changes in Mental Traits with Age, Determined by Indi- 
vidual Re-tests,’’ Chap. VI. Here are presented data from other investigations, 
covering fourteen thousand children of ages nine to fifteen, and four thousand of 
ages sixteen to eighteen. 

2 Dewey, Child, and Ruml: “ Methods and Results of Testing School Children,” 
1920. 
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public school children, so as to secure an unselected school population, 
found that 34 per cent of the nine-year-old boys and 34 per cent of the 
nine-year-old girls were below grade IV, while 38 per cent to 46 per cent 
were in the beginning IV. Of the ten-year-old boys 16 per cent were 
in or below the beginning grade IV, while 26 per cent of the ten-year- 
old girls were similarly classified. Since we did no testing below 
the advanced IV, and since the testing was done at the end of the 
school year, it is seen that the nine- and ten-year-olds represent a 
more highly selected group than those of later ages. This, of course, 
is upon the basis of grade reached in a carefully graded and carefully 
supervised school as an index of ability. 

Under these circumstances we would expect greater gains in the 
earlier years in those functions which we think of as being more 
closely associated with intellectual ability. The curves for memory, 
higher and informational functions seem to show this, or rather, are 
to be interpreted with this fact in mind. Without doubt a few children 
of superior ability were not tested at the later ages, having already 
completed the ninth grade. 

Then too, quite a number of the older children made such high 
scores at their first testing that at later re-tests there was no opportu- 
nity to show much gain in gross score in some of the tests (e.g., a twelve- 
year-old spelling fifty-seven of a sixty word test, could not show the 
same improvement in gross score during the two following years as 
could a ten-year-old who spelled twenty-six words at his initial testing; 
nor could a thirteen-year-old, having an initial score of 182 out of a 
possible 190 in reading vocabulary gain as much as a ten-year-old 
scoring 128). There are enough of such cases to account for much of 
the decrease shown in rate in several of the tests. Then, too, to make 
a gain of two points from a high initial gross score probably represents 
greater absolute improvement than the same gain in points from a 
lower initial score, yet we have no way of allowing for this fact in 
making up Table I. These two considerations (such high initial 
scores as allow no chance for much improvement in gross score, and 
the significance of small increases in gross scores from high initial 
scores) account for much of the decrease shown at the later ages, and, 
when considered in connection with the effect of selection at the earlier 
ages, indicate that the rate of mental growth is probably very nearly 
constant from nine to fifteen for school population. 

To analyze the data still further we have combined the gains into 
groups of functions upon three other bases of classification: (1) as to 
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presence or absence of very high scores; (2) as to influence of school 
instruction; (3) as to ability required to make an initial score. The 
curves for these groupings show the following resemblances: higher 
functions with (a) tests in which very high scores were absent, (b) 
tests much influenced by school instruction, and (c) tests in which 
making an initial score requires much ability; informational functions 
with (a) tests in which some perfect or very high scores were made, 
(b) tests little influenced by school instruction, and (c) tests in which 
making an initial score requires comparatively little ability. Further- 
more, these two sets of curves are not at all dissimilar, the classification 
upon basis of presence or absence of very high scores giving the only 
curves that show the same differences as Figs. 3 and 4 in the amount 
of decrease in rate at the later ages. 

Significant sex differences in the rate of mental growth are not 
shown. Combining results for boys and girls of the same ages, we 
have twenty-four to forty-five cases of each age group. 

Space permits but very brief mention of the light our data throw 
upon the doctrine of compensation. The correlations between the 
gains in the different groups of functions have been calculated by the 
Pearson product-moment formula, age and sex having been equalized. 
These coefficients, uncorrected for attenuation, are positive but low, 
varying from 0.04 to 0.31. These indicate that gain in one group of 
functions is not accompanied by loss or decrease in another group, as 
has often been stated. 

It is further seen that mental growth is here found to be at a 
regular rate from year to year, and not characterized by sudden spurts. 
Of course, it may well be contended that such spurts do occur but that 
the annual re-tests do not show them; that, if we had a great number 
of reliable individual mental growth curves—curves showing growth 
during a period of years by three- or four-month intervals—we might 
find spurts. This contention involves at least the three following 
considerations: 

First—We must not overlook the fact that the doctrine of spurts, 
is based, not upon extensive experimental evidence, but upon observa- 
tion and reasoning by analogy from physical development: that experi- 
mental evidence seems to indicate regularity rather than irregularity. 

Second.—We do not now have, and in the present state of develop- 
ment and educational tests and scales it is doubtful if we can secure 
such reliable individual mental growth curves as are needed to settle 
this question. To do so we need tests of a very high degree of reliabil- 


4 
j 
i 
‘ 
q . 
‘J 
i 
rig 
‘ 
a 
i, 
- 
ta 


510 The Journal of Educational Psychology 


ity; for many functions we need enough alternates of demonstrated 
equivalence that two tests may be given at any one time (say one day 
to one week apart) and, three or four testings per year may be made 
without introducing the practice effect that would come from using the 
same form of a test three or more times a year. We need, also, to be 
sure that all who are being tested have a certain familiarity with test 
procedures, so that the improvement due to becoming more used to 
tests, may be reduced to a minimum. These are special considera- 
tions in addition to the usual standard procedures. 

Third.—The re-test method may be called in question on account of 
the limitations due to our present lack of an adequate number of 
reliable tests. It may be urged that this deficiency may be overcome 
if, instead of the re-test method, we test many thousands of each age, 
grouping them by quarter-years of age. Bureaus of research that 
have a broad program of measurement can supply much valuable 
data of this kind. Such bureaus, however, by using suitable individual 
record cards, and by following a consistent program of testing from 
year to year, will also have valuable data for such studies of individual 
mental growth as our present (and of course, future) tests and scales 
enable us to make. The re-test method, despite the limitations placed 
upon it by our present tests and scales, has certain well-recognized, 
fundamental advantages, long ago pointed out by Thorndike. 

Conclusions.—(1) We have seen that the re-tests of this group of 
one hundred seventy-one children, ages nine to fifteen, by a large 
battery of tests, show mental growth to be at a rate probably very 
nearly constant from year to year. The variations from straight- 
line development at a constant rate are probably best interpreted as 
due to selection at the earlier ages, to the presence of numerous high 
scores at the later ages, and to the small number of cases. . 

2. No significant sex differences in rate of mental development are 
found. 

3. Regularity, rather than irregularity, seems to characterize 
mental development as measured at yearly intervals. 

4. Increase in ability in one group of functions does not seem to 
be accompanied by a decrease in ability in some other group. 

5. Wherever careful testing is done, individual cumulative records 
of test results should be kept. Such procedure, if properly stand- 
ardized, will give an increasingly large amount of very valuable data 
on the problem of mental development of school children, and will give 
it with a minimum of extra expenditure of time and money. 
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VERBAL AND ABSTRACT ELEMENTS IN 
INTELLIGENCE EXAMINATIONS 


JOHN P. HERRING 


Bureau of Educational Research, Bloomsburg State Normal School, Bloomsburg, 
Pennsylvania 


This is a study of relations existing between human intelligence on 
the one hand and certain definite abilities on the other. These definite 
abilities are comprised in two somewhat distinct scales; first, a scale of 
situations extending from very concrete to very abstract; and second, a 
scale of situations extending from very non-verbal to very verbal. 

The conclusions of such a study as this should relate both to pure 
and to applied psychology. In human nature, original and acquired 


taken together, what relations exist among the abilities named? 


And in practice, for the purposes of prediction and of control, what 
relations? How may we best estimate intelligence—through samp- 
lings of concrete or of abstract work—non-verbal or verbal? 

We will call the scales C for concrete and N for non-verbal. 

Seale C involves two concepts: concrete and abstract. Concrete 
is taken to mean: pertaining to things or objects; having no, or little, 
reference to the investigation of relations between objects; not prob- 
lematical in character. Abstract is taken to mean: pertaining to 
relations between objects; involving higher intellectual processes, 
particularly analytic and synthetic thinking; problematical in charac- 
ter. The distinction is not one between working with and working 
without material objects; for the same material things may enter 
equally into responses that are most widely separated in this scale. 
The Stenquist Mechanical Tests probably involve far less of the con- 
crete and more of the abstract than is obvious. 

As illustrative of concrete situations we may take the following: 
(1) Threading the mazes of the Army Beta test. More difficult and 
complicated mazes would be more abstract, more problematical; 
(2) Classifying blue squares and red triangles as in the Dearborn 
Intelligence Examination. Tests differing from these, not by involv- 
ing closer discriminations of the same sort but by involving similar 
and more complicated discriminations, would be more abstract; (3) 
Shingling a roof after the process is well mechanized. Learning to 
shingle a roof by experimentation without instruction would be more 
problematical; (4) Street-sweeping. 

As further illustration of abstract situations we may take these: 
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filling a hard completion test; solving hard analogies; reading difficult 
scientific literature; planning and supervising the construction of 
large bridges; conducting the business affairs of Standard Oil. 

Scale N also involves two concepts, non-verbal and verbal. Verbal 
is taken to mean: involving the use of words and other symbols 
such as numbers, mathematical signs of equality and inequality or 
any subjective imagery taken to represent and to retain during 
further study earlier outcomes of investigation. Non-verbal is taken 
to mean: not involving the use of symbols; pertaining to experiences 
in which the subject deals immediately with its object; not involving 
representative symbolism. 

Verbal or symbolic situations may be illustrated, obviously enough, 
by verbal completion and analogies tests and arithmetical problems, 
and somewhat less obviously by teachers’ estimates of children’s 
intelligence; non-verbal non-symbolic situations by writing S between 
two numbers or symbols that are the same, and D between two that 
differ; by threading mazes; and by learning digit symbol combinations. 

It is assumed for these two scales: first, that it is false to classify 
human behavior dualistically as necessarily either concrete or ab- 
stract; necessarily either non-verbal or verbal; and second that human 
behavior may be described in these two phases as comprising re- 
sponses, the distribution of which is continuous throughout both scale 
C and scale N, some responses being, for instance, extremely concrete; 
some just a little less concrete, and so on, without gap, throughout the 
gamut to extremely abstract. These assumptions are roughly 
borne out by a number of subjective judgments which will be 
presented. 

The definitions of the concepts abstract, concrete, and verbal are not 
regarded as final or even as necessarily valid, but much rather as 
starting points for the work of classifying the samplings of behavior 
elicited by a number of group examinations. These examinations 
were of two classes, selected for two distinct purposes: first, a group, 
seven in number, employed as a weighted composite criterion of 
intelligence; and second, a group, five in number, employed as the 
means of measuring concrete-abstract abilities and non-verbal— 
verbal abilities. The first group, the criterion of intelligence, com- 
prised the following with weights as stated: First, the Stanford Revision 
of the Binet Simon Tests, weight 3; second, educational age found by 
averaging the results of the Thorndike Reading Test Alpha 2, the 
Woody-McCall Arithmetic Test, the Monroe Arithmetical Reasoning 
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Test, and an Ayres Buckingham Spelling Test, weight 3; third, a 
measure of intelligence based upon the age and grade reached at the 
time of the study, weight 3; fourth, the National Intelligence Tests, 
Forms A and B, weight 2; fifth, the Thorndike Reading Test Alpha 2, 
weight 1; szzth, teachers’ estimates of intelligence, weight 1; and 
seventh, the Kelley Trabue Completion Test Alpha, weight 1. 

This weighted composite criterion of intelligence comprises a very 
wide range of activities, involves several modes of observation obtained 
upon a number of different and widely separated days, and exhibits 
self-correlations between 0.9 and 0.8 in age groups. The validity of 
this criterion is assumed. It is just the sort of criterion psychological 
investigators everywhere assume. Any who do not accept its validity, 
may, on that account, differ with the conclusions offered. 

The second group, used as a measure of the dependent variables, 
C and N, comprised the following: 


The Dearborn Group Intelligence Examination. 
The Army Beta Examination 

The Pressey Primer Mental Survey 

The Indiana Cross-cut Tests 

The Thorndike Visual Vocabulary 


All the examinations of both groups were divided into short 
units, each about a page in length, assumed to be homogeneous with 
respect both to scale C and toscale N. These units were bound into a 
book of 63 pages, one page per unit, which was submitted to five 
judges, who classified its content by assigning each unit a position from 
1 to 7 upon both scales. These judgments combined two elements: 
first, an averaging of opinion concerning the content of the concepts 
themselves (abstract, concrete, verbal and non-verbal), and second, 
an averaging of opinion concerning the content of the situations repre- 
sented in the book of units. It was important to give each judge a 
voice in the determination of the meaning of the concepts as well as in 
the classification of the test material. Relatively little stress is laid, 
therefore, upon the rather formal definitions already presented. To 
one asking for better definition would be given the average findings 
of the judges themselves. The term “‘abstract’”’ comes now to mean— 
having the character of situations like those marked at or above 5 in 
this book of units. The uniformity of judgment is indicated by the 
average correlation, judge with judge, of 0.542 + 0.019. 
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A result of this classification is the existence of two subjective 
scales, by comparison with which any homogeneous portion of test 
material or of behavior, may be rated for abstractness and for sym- 
bolism. Among the less obvious comparisons which are thus made 
possible are these: 

1. Digit symbol tests are distinctly concrete and very non- 
symbolic. The “symbols” in such tests are the things themselves, 
not symbols of other things. 

2. The Army Beta Maze Test is as concrete as anything found, 
and (expectedly) as non-verbal as it is concrete. 

3. The Beta Cube-counting Test is distinctly an abstract one and, 
as would be expected, is very non-verbal. 

4. The Woody McCall Arithmetic Test is rated 5 in abstractness 
and about 314 in verbal quality. 

5. Arithmetical problems are rated very abstract. 

6. The Stanford Binet Tests average as a whole rather abstract 
than concrete, rather verbal than non-verbal, but they do not occupy 
extreme position in either respect. In both scales which extend poten- 
tially from 1 to 7 they are rated about 5, leaving in both series about 
15 out of 63 samples above. 

7. The position of the composite criterion of intelligence is just 
about 5, like the Stanford Binet, upon both scales. 

8. The scale of non-verbal—verbal quality runs from the Beta 
Maze Test at one extreme to the Thorndike Reading Test Alpha 2 
at the other. The scale of concreteness—abstractness extends from 
the Beta Maze at one end to the Analogies of the National Intelli- 
gence Tests at the other. 

In the field of applied psychology, for the purposes of prediction 
and control, the following results and conclusions may be tentatively 
stated. | 

The raw correlations include those of four grades in a public 
school and those of three age groups. These are grades IV, V, VI, 
and VII. They have membership ranging from 23 to 41 per grade 
and totalling 118. They are extraordinarily homogeneous with 
respect to mental age, the standard deviations averaging only 4 
months, as against perhaps three or four times that spread in many 
public school grades. This reduction of variation operates to reduce 
correlations but leaves them comparable. Averaging the several 
grades, we have correlations as follows: 
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Intelligence: 
With the middle portion of scale 0.36 
0.31 
0.54 


Similar correlations for age groups consisting of averages of ten 
year olds, eleven year olds and twelve year olds, are as follows: 
Intelligence: with 


The P.E.’s of the correlations entering into all these averages 
range from 0.056 to 0.130 and average 0.10. The average r’s them- 
selves are of course much more reliable. Further, these correlations 
vary almost as do the S.D.’s of the mental ages; the larger the 8.D., 
the higher the r. So close is the correspondence that the average 
correlation of the r’s with the S.D.’s is 0.906 for raw and 0.886 for 
corrected coefficients. Increase in the 8.D.’s involved in a correlation 
results in an increase of correlation which is paralleled by no fact or 
change in human nature; it is a function of increase in the variability of 
the group measured. With regard to the correlation existing between 
the two traits, it is wholly spurious. A correlation of 0.4 in a group 
having a standard variability of 5 months of mental age may mean just 
the same degree of mutual implication as a correlation of 0.8 in a group 
having a much higher standard variability. It appears, therefore, 
that these correlations have a reliability much greater than the custom- 
ary statistical formula of probable error reveals. 

The correlations point in the direction of the following conclusions: 

1. Abstract and verbal tests afford better means for the prediction 
of human intelligence and the control of human situations than do 
concrete and non-verbal tests. 


2. Abstract and verbal tests will make the best material that can 
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be selected from such tests as those employed in this investigation 
for the purpose of inclusion in intelligence examinations. 

3. The middle mongrel portions of the two scales are to be dis- 
tinguished by means of their correlations from the concrete and non- 
verbal portions; they are in general distinguishably better material for 
intelligence examinations than N1 and Cl and very inferior to N3 
and C3. 

4. These abstract and verbal tests, besides being correlated more 
highly with intelligence, have higher coefficients of reliability and are, 
for this reason, other things equal, better content for intelligence 
examinations. 

It is by no means argued that it is never proper to use concrete and 
non-verbal tests of intelligence, for there are circumstances in which 
they are the feasible form of procedure; nor that they should never be 
included in a battery of tests for literate and intelligent adults, for 
the data do not permit the conclusion that concrete and non-verbal 
tests may not properly supplement the others in a battery of tests. 

Perhaps the outstanding conclusion is the following: it seems to be 
the more purely abstract and the more purely verbal tests that afford 
the closer measures of intelligence. — 

We come now to examine the corrected correlations between intelli- 
gence and the different portions of these two scales. 


CoEFFICIENTS OF CORRELATION CORRECTED FOR ATTENUATION, .-7MN3; 
Raw AND CORRECTED 


(M means intelligence) 


Raw CoRRECTED 


The following interpretation is suggested: 

Total human nature and the mutual demands of human beings 
have become such that intelligence, as it is required for success in 
contemporaneous human society, comprises largely the ability to deal 
effectively with situations involving the use of language and of mathe- 
matical and other symbols, both subjective and conventional, and 
also the ability to control situations through the analysis and inter- 
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pretation of novel and complicated phenomena. It is hard to think 
of any important posts of social responsibility, from railway surveying 
to international law, of which this does not seem true. It is the hod- 
carriers who typically deal with concrete situations. The master 
architect controls the situations but he does so through such as the 
hod-carrier, whom he reaches by the way of many and abstract 
processes involving complex symbolisms. 
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WHAT IS READING ABILITY? 


J. BENSON WYMAN AND MIRIAM WENDLE 
Stanford University 


Tests of reading ability have been devised—many of them. Their 
reliability coefficients have been determined and norms have been 
obtained; but do their reliability coefficients or their norms guarantee 
that they are good tests of reading ability or tests of reading ability 
at all? A reliability coefficient is only one criterion of a test. It 
measures the amount of agreement one would expect between an 
individual’s score on one form of a test and his score on any other 
comparable form of the same test; but is there anything to show that 
the so-called reading tests do measure reading ability? 

The present studies were undertaken to get at a method by which 
we could determine whether the so-called reading tests do measure 
reading ability. The questions arose: ‘‘What is reading ability? 
What criterion can be used for measuring it?” The first suggestion 
was to use teachers’ estimates of the reading ability of their pupils as 
the criterion. But, knowing the fallibility of teachers’ estimates, an 
alternative criterion was also used and the plan adopted to obtain 
it was the following: Two professors and three graduate students 
conversant with the tests used were asked to assign values to each 
test indicating the value of the test as a measure of silent reading 
ability (z.e., they gave their judgments of the values of the tests as 
tests of reading, whereas for the first criterion the teachers gave their 
judgments of the reading ability of the pupils). These five values 
were made independently. The five values for each' test were then 
averaged and this average value was taken as the scoring, or weight, 
for the test as a reading test. Then this alternative criterion for 
reading ability consisted of the combination of all the tests weighted 
according to the above average values. Having then these two 
criteria for reading ability the procedure was as follows: 

Reliability Coefficients—The correlation between a given set of 
scores in one test and another set similarly obtained on a similar form 
of the same test was determined. In cases where there were not two 
comparable forms of a test, the one form was divided into two halves 
measuring substantially the same thing. These halves were corre- 


lated; and then Brown’s formula (=) —where r was the obtained 


correlation—was applied; and this gave an estimate of the correlation 
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between two forms of the test. This correlation is the reliability 
coefficient of the test. 


Correlations.—(1) Each test was correlated with every other 
test by means of Pearson’s Product-Moment formula and the 
probable errors of these correlations were determined from 

PE. = 0.6745(1 — 7°) where correlation 
- nm = number of cases 

2. The average of the independent estimates of the reading bf 
ability of the pupils made by the two teachers was taken as 
the first criterion of reading ability. (Call this Reading 
Ability T.) The sum, or the average, of the pupils’ scores 
on each test was taken as, the test score. Then each test 
score was correlated with Reading Ability T. 

3. The second criterion for reading ability (call this ; 
Reading Ability C) consisted of the combination of all the ; 
tests weighted according to their values as measures of reading | 
ability—with the modification that, when a test was to be 4 
correlated with it, that test was taken out of the criterion. ¢ 
(Suppose, for example that the Thorndike Reading Alpha 
test were to be correlated with Reading Ability, then the 
criterion for reading ability would be a combination of all the 


oN 


= 


weighted tests except Thorndike Alpha.) In order to deter- ' : 
mine this correlation, the following formula’ was used: 
iz Wz Oz 


where 

Yrizw.0, = the sum of the correlations (each multiplied by 
its weight and standard deviation) of the type Thorn- 
dike Alpha and Monroe, Thorndike Alpha and 

Completion Beta. 
LrzyW2,Wy7, = the sum of all the intercorrelations each one 
multiplied by the weights and standard deviations 
of both the tests correlated. : 


4. Spearman considered that if there were any errors in the 
original scores of the pupils due to chance mistakes, they 
would not compensate one another but would reduce the i, 
correlation. He devised the following formula by the appli- ; 


1 Kelley, T. L.: Bulletin 27, University of Texas, 1916. 
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cation of which the obtained correlations would be corrected 
for this, so that the corrected coefficient measures the extent 
to which the test would correlate with the criterion if the 
score of the individual were an accurate one both in the test 
and in the criterion: 


r 
Corrected coefficient R = 
Ti3 V 
where 
r = correlation between test and criterion 


Tg = reliability coefficient of test 
Yo, = reliability coefficient of criterion 


R then is the correlation between a true reading score and a 
true criterion of reading ability. 

5. Then that test is more uniquely a reading test and less a 
test of any other function which shows the highest R. 


Detailed Procedure.—Two studies were made. One was with grade 
VIII B. pupils where reading tests were the main tests, and the criteria 
were Reading Ability T and Reading Ability C. (We shallrefertothis 
study as “VIII grade reading study.”?) The other study was with 
High School pupils, English tests were used and the criterion of 
English Ability—call it English Ability E—was the Teachers’ grades. 
(This study we shall refer to as “High School English study.’’) 

Tests Given.—(1) The following tests were given to 36 pupils in 
grade VIII just before promotion. Two teachers independently 
ranked the pupils in the order of their ability to read: 

Thorndike’s Reading Scale Alpha 2. 

Monroe’s Silent Reading Test II (forms 2 and 3). 

Thorndike’s Reading Test B—Visual Vocabulary (series x and y). 

Kelley-Trabue Completion Exercise Alpha. 

Terman Group Test of Mental Ability (form A). 

Seven S Spelling Test (list 13). 

Woody-McCall Arithmetic (forms 1 and 2). 

Compositions (1) (‘What I should like todo next Saturday.” 
(2) ‘‘The most exciting ride I ever had.’’) 

2. The following tests were given to 94 pupils of the Senior class 
of a High School: 

Briggs’ English Form Test (Beta). 

Compositions (2). 

Thorndike-McCall Reading Scale (form 1). 


= 
. 
ule | 
he 
‘Wy a 
| 
| 
} ' 
"> 
Ab 
wa 
\ 
‘\ / 
er 


What is Reading Ability? 521 


Kelley-Trabue Completion Exercise (Beta). 

Abbott-Trabue tests of Poetic Appreciation (series x and y). 

In this study the teachers’ grades in English (English Ability E) 
for the previous semester were then secured as an objective measure 
against which to gauge these tests as tests of ability in English. The 
Terman Group Test of Mental Ability had been given, so that the 
scores on a reliable intelligence test were at hand for comparison. 
Since it was thought the correlations of the English tests with an 
arithmetic test might also bring out interesting relations, the scores on 
the arithmetic exercise in the Terman Group Test were used. 

Reliability Coefficients—For Thorndike’s Visual Vocabulary, 
Woody-McCall Arithmetic, Monroe Silent Reading, Compositions 
and Abbott-Trabue tests one form was correlated with the second 
form. 

For Spelling, Completion, Terman Group, Opposites Beta, Terman 
Arithmetic, Thorndike-McCall and Briggs two halves were obtained 
and correlated and then Brown’s formula was applied. 

For teachers’ estimates (in the first study) one teacher’s marks 
were correlated with the other teacher’s and then Brown’s formula was 
applied. But, since the Teachers’ grades (in the High School English 
study) represented the marks of only one teacher per individual, it was 
necessary to estimate their reliability. In a study of teachers’ esti- 
mates, T. L. Kelley (‘‘ Educational Guidance,” sec. 4, page 15) found 
consistently low reliability coefficients. Teachers’ gradings are 
probably somewhat more reliable. So the reliability of Teachers’ 
Grades was estimated to be about 0.45. 

Thorndike’s Alpha 2 Test consists of passages that are to be read 
and questions on the passages are to be answered. In “ Difficulty 7” 
there are two paragraphs in the passage with four questions to be 
answered on the first paragraph and three on the second. In “ Diffi- 
culty 8” there are two paragraphs with four questions on each. In 
“Difficulty 824” there is one paragraph with four questions on it; 
and in “‘ Difficulty 9” there is one paragraph with five questions. It 
will be seen then that there are two ways in which the test can be 
divided into two comparable halves. The one way is to divide the 
test so that unbroken paragraphs and twelve questions are in either 
half. The other way is to divide the test so that there is the same 
number of questions on either side but neither side has unbroken 
paragraphs. The test was divided into two parts, according to the 
former method, by splitting it into its paragraphs so that the errors in 
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the first four questions in Difficulty 7, in the second four in Difficulty 8 
and in all four in Difficulty 824 formed one part while the rest of the 
errors formed the other part. These two parts were then correlated; 
and the reliability coefficient was obtained by applying Brown’s 
formula to this correlation. The test could be divided into two parts, 
according to the second method, by taking the first half of the sum of 
the errors in Difficulty 7, and the second half in each of the Difficulties 
8, 824 and 9 as one part and the remainder of the errors as the other 
part, and correlating them. 

Now, whether an individual can answer Question 2 depends to a 
certain extent on whether he can answer Question 1, whether he can 
answer Question 3 depends to a certain extent on whether he can 
answer Question 2, and so on. This is the same for each paragraph. 
Let us call this correlation between questions based on the same 
paragraph p and the correlation between questions based on different 
paragraphs y. If we then call the value obtained by correlating the 
two parts of Alpha 2 according to the first method of division r; and 
that obtained by the second method re, we can determine the value of 
p and 7 from the following equations: 

n+ Mm(m— 1)p + n(n— m)n 
where m = number of terms in a group 
M = number of groups 
n= Mm 
+7) 
1+ (n —1)p +m 


where n = number of terms in either half. 


= 


= 


By solving these equations we find 


p = 0.179 
n = 0.060 


Then p being greater than 7 proves that the correlation between 
answers on a single paragraph is greater than that between answers 
on different paragraphs. Correlation 7 is due to a certain intelligence 
level (a child) acting upon certain independent tasks whereas p is due 
to this plus a factor (operating much as a chance factor) which aids in 
answering subsequent questions in a set if the first is answered cor- 
rectly and which hinders if the first is answered incorrectly. There- 
fore p is spuriously high, due to correlation between errors, as a measure 
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of the correlation between independent questions. Since this is so, 
re is spuriously high as a reliability coefficient, because the two halves 
correlated in obtaining rz are not composed of independent exercises. 
Accordingly r; is the correct value for the reliability coefficient for 
Alpha 2. 

1. The following are the reliability coefficients for each test (Grade 
VIII Reading study): 


Thorndike’s Visual Vocabulary...................... 0.79+0.04 
Monroe Comprehension...... ..................... 0.75 +0.05 
Woody-McCall Arithmetic... 0.70+0.06 
Thorndike Alpha 2........ 0.53 +0.08 


These reliability coefficients are measures of reliability based on 
the same pupils. So any difference in them cannot be charged to 
differences in range of talent. Hence, as regards reliability alone, we 
can place the above tests in the order indicated by the coefficients. 

Terman Group, Teachers’ Estimates and Spelling are the most 
reliable. Thorndike’s Visual Vocabulary, Monroe Comprehension, 
Woody-McCall Arithmetic and Monroe Rate are satisfactory; but 
neither Alpha 2 nor Completion can be regarded as altogether satis- 
factory. The reliability coefficient for Composition, based on only 
two compositions, is very low. If compositions are to be used, by 
applying Brown’s formula it can be seen that in order to have a relia- 
bility coefficient comparable to the Terman (0.85) it would be necessary 
to give 17 compositions: 


Reliability coefficient = sad 


1+(n—1)r 
_ —-n(0.25) 
=n=17 
where 
nm = number of compositions 
r = reliability coefficient for 2 compositions = 0.25 
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2. The following are the reliability coefficients for each test (High 
School English study): 


0.74+0.03 
Thorndike-McCall 0.63 +0.04 


The low reliability of the Abbott-Trabue Poetic Appreciation test 
accords with the results described by Abbott and Trabue in the article 
entitled ‘‘A Measure of Ability to Judge Poetry” (Teachers’ College 
Record, March, 1921). Therefore it is not possible to measure 


poetic appreciation by means of this test. The high reliability 


coefficient of the Terman test agrees with the findings in the first 
study. The reliability of Teachers’ grades, Composition and Tra- 
bue is too low to make them satisfactory measures. (See table, 
page 526.) 


(a) No correlations are exceptionally high. As might be expected 
the correlations of Terman Group with Completion and Terman Group 
with Opposites are highest, since both Completion and Opposites are 
used as Intelligence tests in lieu of using Terman Group. The high 
correlation between the Terman Group test and the Terman Arithme- 
tic is partly due to its being a correlation between Terman and 
part of itself. 

(b) The highest correlations are those between Terman Group 
and certain English tests rather than between the English tests 
themselves. It may be that different aspects of English ability 
may not exist ordinarily in the same individual; but the impli- 
cation may be that the Terman Group test, because of its greater 
reliability, is a better measure of English ability than anyone of the 
English tests. 

(c) The correlation between Teachers’ grades (English Ability E) 
and Thorndike-McCall Reading is the highest; Teachers’ grades and 
Terman Group next; Teachers’ grade and Completion next, and then 
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Teachers’ grades and Opposites Beta. But none of these correlations 
is high, indicating 


(a) that these tests do not 
measure English ability as 
it is judged bythe teachers, 
or 


(b) that the reliability of 
teachers’ judgments is low, 
or 

(c) that other factors than 
English ability or General 
Intelligence affectEnglish 


grades. 


It seems that the Terman Group test is as indicative of English ability 
on the criterion of Teachers’ grades as the more apparently English 
tests are. 

2. Correlations, in the VIII Grade Reading study, between Reading 
Ability T and the tests. (The criterion for Reading Ability T was 
the average of the independent estimates of the pupils’ reading ability 
as made by the two teachers) : 


Reading Ability T and Terman Group..................... +0.77+0.05 
Visual Vocabulary......... +0.62+0.07 
+0.59+0.07 
+0.54+0.08 
Monroe Comprehension............. +0.49+0.085 
Monroe $0.1040.11 


These correlations show that what teachers call “reading ability”’ 
correlates more highly with what the Terman test measures than with 
what the so-called reading tests measure. The rate of reading, as 
measured by the Monroe tests, shows very little correlation with the 
teachers’ estimates of reading ability. Age within a grade has, as 
would be expected, a negative correlation with reading ability. 
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3. Intercorrelations between tests (in the Grade VIII Reading 
study): 
| | Woody- 
Terman a Monroe | - | sal Compo- 8 McCall y 
ulary | hension Alpha 2 
ing metic 
Visual Vocabulary... 0.69 | | 
Monroe Compre-| 
hension.........| 0.65 | 0.44 | ! | 
Kelley-Trabue com- | 
pletion.......... 0.64 | 0.62] 0.30 
Thorndike Alpha2. 0.58 | 0.56) 0.20 0. 54 
Composition..... . , 0.55 0.45 | 0.20 0.32, 0.57 
Seven S Spelling.... 0.53 0.55 0. 54 0.21 0.17 0.38 
Woody-McCall 
Arithmetic... ... 0.53 | 0.39) 0.45 | 0.50, 0.24] 0.40 | 0.42 
Monroe Rate...... | 0. 26 0. 23 0.62 |-—0. 10 0.07 0.01 0.80 —0.15 
4. Correlations, in the VIII Grade Reading study, between Reading 
Ability C and the tests: | 
Weighting of Tests——The following are the average values or 
weights determined as described previously: 
Correlations 
Reading Ability C and Terman Group........................ 0.85 
Visual Vocabulary..................... 0.76 
Monroe Comprehension. ............... 0.53 


Here the highest correlation is with the Terman Group test; and, 
as the correlation is a high one, we must conclude they measure very 
much the same thing. The question arises, is our criterion for reading 


- 
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ability really a measure of general intelligence, or does the Terman 
Group test measure reading ability? 

Comparing the correlations 2 and these correlations, we see that, 
according to either criterion, the Terman Group test ranks highest 
as a measure of reading ability, just as the Monroe Rate of Reading 
test shows least relationship. The values for the other tests vary in 
the two lists. It would seem that the second of the two criteria 
(Reading Ability C) was the more reliable measure of reading ability, 
for the teachers’ estimates of the reading ability of their pupils are 
more likely to be tempered by their knowledge of the general intelli- 
gence of the individuals. 

The best test then for reading ability, as far as our data are con- 
cerned, is the Terman. Visual Vocabulary and then Completion and 
Thorndike Alpha 2 are the next. Rate of Reading cannot be con- 
sidered a test of reading at all in so far as our criteria measure reading 


ability. 
5. Corrected Coefficients of Correlation: 
(a) English Ability E and Thorndike McCall Reading......... 0.92 
Abbott-Trabue Poetic Appreciation.. 0.76 
Kelley-Trabue Completion Beta... .. 0.67 
Terman Group Test of Mental Ability 0.65 
Opposites Beta.................... 0.59 
Terman Arithmetic................ 0.28 
Composition (2) 0.26 
0.07 
(b) Reading Ability T and Composition...................... 1.29+0.25 
Visual Vocabulary................. 0.83 +0.08 
0.67+0.11 
0.15+0.17 
(c) Reading Ability C and Composition...................... 1.14+0.21 
Visual Vocabulary................. 0.85 +0.05 
0.61+0.09 
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For this group of correlations (c) where the criterion was Reading 
Ability C, the probable errors given are the values when the reliability 
coefficient of Reading Ability is assumed to be 1. Were it less than 1, 


1 
the corrected coefficients would be an times greater and the probable 
24 
errors would be greater. 


In determining the probable errors for these corrected coefficients 
the following formula! was used: 


P.E. = 0.6745 (1 — ri3?)? + (1 — ra4?)? 


Aro,” 
(1 — ris) (2 — 2r? + ris — M13") (1 — rx) (2 — 2r? + Tu?) 
2ri3 2724 
rd — 113) (lL — 
24 


Spearman found some corrected coefficients greater than unity. 
He says, ‘‘ At most, the corrected coefficient is only the true coefficient 
plus the error due to testing a limited sample—the general magnitude 
of such an error is indicated by the so-called probable error; and though 
a true coefficient cannot exceed unity there is no reason why a coeffi- 
cient plus an error should not do so. In such a case the coefficient 
must be taken as 1—this being its most probable value.” 

The only corrected coefficients that would support Spearman’s 
contention that these coefficients are 1 would be those near 1.00 and 
having small probable errors. 


Suppose a corrected correlation 1 +a. If its probable error be 
equal to, or less than, “ then the fallacy of Spearman’s contention 


(that the most probable value of the coefficient is 1.00) is very evident. 
It proves that his hypotheses (lack of correlation between errors) 
are unsound. Suppose, for example, that we have a population of 
3600, and the corrected correlation between Reading Ability and 
Composition is 1.29 + 0.025. Spearman would say the true correla- 
tion was 1.00—without any regard to the probable error value. This 
correlation (1.00) is about as unreasonable a value as could be chosen, 
for the chances that the correlation 1.29 would ever be 1.00 are, in 
the light of its probable error, infinitely remote. 


1 This formula for the P.E. of a coefficient of correlation corrected for atten- 


uation was derived by Dr. Truman L. Kelley, but has not hitherto appeared in 
print. 
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Had Spearman known the probable errors of his corrected coeffi- 
cients, he would have seen the absurdity of claiming that coefficients 
well above 1.00 supported his argument that all corrected coefficients 
tended toward 1.00. | 


GENERAL CONCLUSIONS REGARDING THE TESTS 


1. Certain of the English tests are too unreliable to be worth much 
in the classification of pupils (e.g., Abbott-Trabue Poetry Appreciation 
test). Opposites Beta is more reliable, Briggs’ Form test and Com- 
pletion Beta are slightly less satisfactory and the Thorndike McCall 
still less. None of the English tests has as high reliability as the 
Terman Group test. 

2. As far as raw correlations with English ability are concerned, 
using Teachers’ grades in English as the criterion, none of the correla- 
tions for any of the tests is marked, the highest being 0.49. 

3. The arithmetic test was included in the battery in the second 
study to see whether the criterion of English ability and the treatment 
would result in the Arithmetic test falling into a low position as an 
English test which would be expected on the a priori assumption that 
English and arithmetic are different functions. Since this corrected 
coefficient is so low (0.293) it affords objective evidence that arith- 
metic and English ability constitute two separate capacities. 

4. From the point of view of the classification of pupils, it is 
probable better results can be obtained on the basis of these tests in the 
second study (except Terman Arithmetic, Briggs’ Form and Abbott- 
Trabue Poetic Appreciation) than can be obtained from Teachers’ 
Grades. Opposites Beta, a 15 minute-test, differentiates more cor- 
rectly than Teachers’ grades and could be used very profitably. 

5. If compositions are to be used as measuring ability, the average 
score on from 15 to 20 compositions must be taken. 

6. According to our criteria for reading ability, the Terman Group 
test of Mental Ability is a better measure of reading ability than any of 
the other tests used. 

7. Of the so-called Reading tests used, the best of them as a test of 
reading ability is Thorndike’s Visual Vocabulary, while Monroe’s 
Rate of Silent Reading test shows almost no correlation with our 
criteria for reading ability. 

8. From these studies can be seen the necessity for having other 
objective information about a test than its reliability coefficient before 


_ the function it measures can be stated. 
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To these conclusions might be added a warning with regard to the 
use of Spearman’s formula—Care must be taken 
(a) that the halves of the tests used in determining the reliability 
coefficients be strictly independent and comparable samplings 
of ability. The tests must be given under similar conditions 
such as will not lead to spurious correlations. 
(b) that the probable errors be determined, and the results be 
interpreted in the light of the probable errors. 
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TERMAN VOCABULARY AS A GROUP TEST 


ANGELINA L. WEEKS 
Miss Hall’s School for Girls 


The Terman vocabulary test may so easily be used as a group test 
that an attempt has been made to measure the reliability of some 
results obtained in this way. For this purpose, the test was given 
individually and as a group test to the same pupils in two private 
schools; one, a girls’ school of secondary grade, the other, a grade 
school. 

In the secondary school the time limit method was employed, ten 
minutes being allowed for fifty words. Each subject was supplied with 
a pencil, paper, and a type-written copy of the words in one column of 
the list which appears on the last page of the Record Booklet for the 
Stanford Revision of the Binet-Simon tests. The instructions were: 
‘Define very briefly the words in this list. It is not necessary to give 
a full definition like that in a dictionary, but a single meaning is 
sufficient.” 

The list of words were so distributed that one half of the pupils 
used the first column, beginning with “gown,”’ which is referred to in 
this report as ‘‘Series A; ” the other half used “Series B,” beginning 
with “orange.” 

After these written group tests were completed, individual tests 
were given, according to Terman’s directions, to the same subjects. 
In each oral test, the subject was given the series of words which she 
did not see in the written test so that no error should arise from a 
possible difference in difficulty of the lists. 

There were fifty-seven girls in the secondary school group examined, 
all pupils in Miss Hall’sSchoolfor Girls. Theagesranged from thirteen 
to seventeen years, with the average agesixteen years. Theirratingsin 
Otis and Alpha tests indicated that the group was above the average 
grade of intelligence. 

The accompanying table gives the averages of results obtained in 
both written and oral tests, and also the averages obtained from the 
two word-lists. 

The variation in method seems to affect the average very little more 
than the variation in word-lists. The apparently greater steadiness in 
the written test, which is suggested by a smaller Q, may be due to 
greater accuracy in rating the definitions. Some subjects gave them so 
rapidly that they were not taken down verbatim in oral tests. 
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NuMBER OF Worps CorrEcTLY DEFINED 
Test Average | Q 


The vocabulary indices obtained in group and individual tests were 
converted into vocabulary ages by the norms of Terman, givenin “The 
Measurement” of Intelligence, and those of Hollingworth, given in 
“Vocational Psychology.” The relative changes in age ratings which 
accompanied the variation in method of testing are given in the 
following table. 

VocaBULARY AGES COMPARED 
TERMAN NORMS 
Variation of written index from oral index in units of age groups 


Written index | Same Lower _—_— Higher | Totals 
| 0 | 1 2 1 2 | 
33 | 18 2 3 1 57 
Percentile frequency............... 58 32 3 5 2 100 
HOLLINGWORTH NORMS 

Written index Same Lower Higher Totals 
0 1 1 
41 13 2 57 
Percentile frequency............ 72 23 3 100 


According to the Terman norms, thirty-three cases, or fifty-eight 
per cent, are in the same age group in both tests; twenty-one, or 
thirty-seven per cent, are one age group lower or higher in the written 
test than in the oral test. According to the Hollingworth norms, 
forty-one cases, or seventy-two per cent, are in the same age group in 
both tests; fifteen cases, or twenty-six per cent, are one group lower or 
higher in the written test than in the oral test. 

With these subjects, whenever the ages indicated by the group test 
and individual test were unlike, the age given by the written test 
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tended to approach more nearly to the chronological age. In ninety 
per cent of these cases, the group test was as accurate an age measure 
as the individual test. 

The correlation of the results obtained in group and individual 
tests was + 0.7487, which, though not very high, is sufficient to warrant 
the use of the group method of vocabulary testing in secondary 
schools. By this means, extreme cases can be selected early in the 
school year. 

The results of the vocabulary tests were correlated with those of 
other tests given to the same pupils in the same year. In these com- 
putations and all similar ones reported in this article the formula 
based on rank order, as found in Thorndike’s Mental and Social 
Measurements, was used. The following table presents these 
coefficients. 

CoRRELATION OF VOCABULARY WITH OTHER TEsTs 


No. of cases canine test Oral test 


Oral | 57 +0.7487 


40.497 40.515 
Hard directions Woodworth-Wells..... 51 +0.401 +0.479 
Completion A Trabue.. ee 46 +0.527 +0.551 
School grades English examination... 51 +0.721 +0.614 
Oral English. . popes 51 +0.784 +0.662 
French examination 51 +0.325 +0.396 
Oral French................ | 51 +0.434 40.617 


In order to study the group method with younger pupils the 
vocabulary was given to thirty-five elementary children. This group 
included boys and girls of Miss Mill’s School, whose grade distribution 
was as follows: 


Grade Number of pupils 
3 9 
4 14 
5 2 
6 1 
7 3 
8 6 
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With the printed list of words, each child was provided with a large 
sheet of paper on which numbers were written corresponding to the 
numbering of the word-lists. Series B was used in the written test and 
series A, in the oral. This was done to remove all chance for the words 
used in the oral tests to be discussed by the children. The lists are so 
nearly equal in difficulty that it seemed fair to do so. 

The instructions were: ‘“‘ You can see a number before each printed 
word. On your large sheet of paper you find the same numbers. 
When I read the word numbered one, you may write the meaning of 
that word after the number one on your paper. It is not necessary to 
copy the word. If you do not know a word, leave that line blank. Do 
not write anything in that space. One short meaning is enough. 
Spelling does not count in this test. I want to find out how many 
words you know. Ready. Number one. What is an orange?” 

The test was conducted in this way until few or no pupils were 
writing. The examiner then said: ‘‘The last words in this column are 
really meant for grown people, but, if you know one of them, it will add 
so much to your credit. Iwill read the words to you. If you hear one 
that you can write a meaning for, raise your hand and we will wait for 
you to write it.” 

Grades five, six, seven, and eight, who were tested in one group, 
completed the list of fifty words in fourteen minutes. Few defined 
words after the thirty-fourth. In grades three and four, only 
twenty-five words were completed in twenty-four minutes. These 
younger pupils did not refer to the printed lists, but depended upon the 
spoken word. 

During the week following the written tests, each child was given 
an oral individual test, exactly in accordance with Terman’s directions. 
Correct definitions were somewhat more frequent than in the written 
test, but the difference was so small that it would make very little 
change in the pupils’ ratings. In only three cases, and those of pupils 
below the fifth grade, was there a marked disagreement between the 
written and oral tests. 

The chronological age range of this group was from seven to sixteen 
- years. The average age was 9.8 years, with a median of nine years. 
The median vocabulary age was ten years. The figures for this group 
are as follows: 


\ 
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VOCABULARY INDICES 


Oral test Written test 
Correlation 
School standing................. +0.236 +0.345 


The relative vocabulary age ratings of these younger pupils in the 
two tests are shown by the following table. 


VocaBULARY AGES COMPARED 
TERMAN NORMS 
Variation of written index from oral index in units of age groups. 


Written index Same Lower | Totals 
ee | 0 1 2 
| 18 13 4 35 
Percentile frequency............ | 51 37 12 100 

HOLLINGWORTH NORMS 
Written index : Same Lower Higher | Totals 
| 
No. of groups....... | 0 _ 2 3 4 1 
No. of cases......... | 14 13 3 3 1 1 35 
Percentile frequency.| 40 37 9 i. 3 3 100 


When the vocabulary ages obtained in the tests were compared 
with the chronological ages, the written test was found to be as 
accurate a measure of chronological age as the oral test in all cases 


except one. 


This was an eighth grade pupil who was nearly sixteen 


years old. The written index pointed to an age of less than thirteen, 
while the oral index gave an age above fourteen according to the 
Terman scale and above thirteen according to the Hollingworth scale. 
In this case, the written test agreed with the school record. 
Measures made thus far indicate that the group test method of 
giving the vocabulary is as reliable as the individual method for 
children above the fifth grade. 
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NOTES ON ARTICLES IN EDUCATIONAL 
PSYCHOLOGY IN CURRENT ISSUES OF 


me*~ OTHER MAGAZINES 


REPORTED BY CECILE COLLOTON 
Department of Educational Psychology, The Lincoln School of Teachers College 


EDUCATIONAL TESTS 


Variation of Marking Systems as Diagnosed by Objective Tests. Riverda H. 
Jordan. Journal of Educational Research, 1921, Oct., 173-179. The distribution 
of school marks in the 6th, 7th and 8th grades in 10 schools of Minneapolis, showing 
the widely divergent marking systems. Comparison of marks with scores on 
intelligence tests. 

Results with Standard Chemistry Tests. B. J. Rivett. School Science and 
Mathematics, 1921, Nov., 720-722. Three Chemistry Tests devised at North- 
western High School, Detroit; (1) symbols of thirty-one important elements, 
(2) valence of twenty important elements and radicals, (3) twenty formulas 
of most common compounds. Norms based on results of tests given in all Detroit 
High schools, Jan., 1921 and June, 1921. 

The Conventional Examination in Chemistry and Physics Versus the New Types 
of Tests. Earl R. Glenn. School Science and Mathematics, 1921, Nov., 746-756. 
A brief discussion of some preliminary tests with instructions on scoring of tests, and 
statistical treatment, graphic representation, and interpretation of test scores. 

Measuring the Progress of Pupils by Means of Standardized Tests. Samuel 8S. 
Brooks. Journal of Educational Research, 1921, Oct., 161-172. How standard- 
ized tests are used in the rural schools of Winchester, New Hampshire. Repro- 
ductions of individual score cards with scores in graph form showing progress 
through the year. 


INTELLIGENCE TESTS 


School Variation in General Intelligence. Warren W. Coxe. Journal of 
Educational Research, 1921, Oct., 187-194. Data on the general intelligence of 
24 sixth grades in 24 elementary schools in Cincinnati as shown by Otis Group 

Intelligence Scale. Study of type of community in which each school is located and 
correlation of character of community with intelligence levels of pupils. 

The Relation of Intelligence to Ability in the ‘‘ Three R’s”’ in the Case of Retarded 
Children. Maud A. Merrill. The Pedagogical Seminary, 1921, Sept., 249-274. 
An investigation of the relation between the intelligence of a group of retarded 
children and their pedagogical ability as measured by standardized educational 
tests in reading, writing, arithmetic and spelling. 
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MISscELLANEOUS 


Some Further Studies of Gifted Children. Elizabeth Cleveland. Journal of 
Educational Research, 1921, Oct. 195-199. Results of studies of three ‘‘special 
advanced” classes compared with a control group of normal pupils in the same 
schools. Studies include health, nationality, home conditions, types of reading, 
and recreation, amount of travel, vocational and educational plans. 

Motivated Drill Work in Third Grade Arithmetic and Silent Reading. J. H. 
Hoover. Journal of Educational Research, 1921, Oct., 200-211. Description of 
certain games and devices utilizing the play instinct in drill work. Results of an 
experiment with these materials in thirty different third grade rooms, including 
571 children in non-drill and 568 in drill sections. Improvements in drill section 
much more pronounced than in non-drill section. 

Comparative Social Traits of Various Races. Charles B. Davenport. School 
and Society, 1921, Oct. 22, 344-348. A study of racial differences in social 
traits. Investigation conducted at Washington Irving High School with; 51 
girls representing ten races. 

Who Can Be Educated? Willard W. Beatty. School and Society, 1921, Oct., 
311-313. A discussion of the need for a laboratory study of children—say, 60 
to 100 children of all races for the period from birth to maturity. 

Some Elementary Statistical Considerations in Educational Measurements. 
J. Crosby Chapman. Journal of Educational Research, 1921, Oct., 212-220. 
A critique of current methods of obtaining norms of achievement for educational 
tests; ultra-refined measurement on one side and ignored errors of selection, ad- 
ministration, etc. on the other. 

Apperceptive Abilities. Augusta F. Bronner. Psychological Review, 1921, 
July, 270-279. The discussion of apperception as a mental process. How present 
tests estimate this ability. 
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NEW PUBLICATIONS IN EDUCATIONAL 
PSYCHOLOGY AND RELATED FIELDS OF 


EDUCATION 


1. Three Books which Deal with Mental Pathology.—Education 
has for one of its chief aims that modification of instinctive action, 
which in current phraseology is called “‘socialization.’”’ Educators 
strive to make man over, from what he is by original nature, into 
what his civilized contemporaries wish him to be. Educational 
psychology must therefore be concerned with the question. What 
happens within the organism when an instinctive tendency conflicts 
irreconcilably with another instinctive tendency, with an idea, with a 
habit, or with a circumstance? 

The three volumes here considered try to answer this question. 
Within the scope of a brief review it is impossible to do full justice 
to the discussions, which cover hundreds of pages, but the more 
interesting points may be outlined, taking the books in what seems to 
the reviewer to be their order of importance for the study of human 
behavior. 

Rivers’ volume! originated in his observations upon the psycho- 
neuroses among soldiers in the great war. It is pointed out that in 
soldiers the danger-avoiding impulses, normally active, are brought 
into sharp conflict with ideas of patriotic duty, habits of military 
drill, and the circumstance of being impressed into military service. 
In this conflict the danger-avoiding impulses are inhibited from 
biologically appropriate motor expression, but they do not thereby 
die out from disuse. The direct motor response being suppressed, is 
transformed into whatever indirect response will allay the impulses. 
Thus develop hysterical blindness, deafness, paralysis, and other 
functional disorders, which enable a soldier to be safe, and at the 
_ same time patriotic, obedient and dutiful. 

Rivers differs from the other authors to be considered in this 
review in finding the primary source of hysterical behavior not in 
sex, but in thwarted “danger-instincts.”” He shows how the facts of 


1 Rivers, W. H. R.: Instinct and the Unconscious. ‘Cambridge University 
Press, 1920, pp. 247. 
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civil as well as of military life are to be harmonized with this view. 
For instance, in civil life women so preponderate among hysterics 
that the disorder takes its name from the Greek word which means 
uterus. But men develop hysterical symptoms readily enough in 
time of war. In civil life women are constantly exposed to the dangers 
of child-bearing, which are analogous to the sufferings and perils of 
war, but which men never have to face. When men face war, they 
face danger, as women face danger all the time; and then hysteria 
appears among men, also. 

In general Rivers finds Freud’s doctrines confirmed as regards 
the great importance of instincts, the unconscious conflict and repres- 
sion of instincts, and the mental mechanisms resulting therefrom. 

Kempf,' also, finds that instinctive tendencies do not disappear 
through training, but, when appropriate and direct motor response is 
made impossible, eventuate in abnormal behavior. Kempf refers the 
disorders chiefly to the instinct of sex. Particularly does he lay stress 
upon the theory of sexual attachment to a parent, either of the same 
sex or of the opposite sex. As Rivers is able to reconcile the phenome- 
na of mental pathology with violated danger-impulses, so Kempf is 
able to trace them to sexual impulses. Everywhere Kempf asks 
whether it is not possible to see a sexual symbol in object, act or word, 
and finds the answer to be affirmative. Let us rather put the question 
thus: Is it possible to find an object, act or word which cannot be 
interpreted as a sexual symbol by one “‘set”’ for that percept? 

Kempf cites case after case of abnormal behavior, seen chiefly 
at St. Elizabeth’s Hospital, where recovery followed psycho-analysis, 
conducted from his point of view. As has often been said, however, 
this is no proof of the correctness of the hypothesis, as the symptoms in 
these cases also disappear without pyscho-analysis (the character- 
istic recovery of the manic-depressive): that the liability to recur- 
rence is less after psycho-analysis is not established by anything 
Kempf presents. 

Although the author’s particular stand-point remains thus in 
question, it is true that he performs a service in emphasizing again 
the importance for education of the autonomic nervous system, too 
frequently neglected in educational psychology. In the training of 
teachers, the central nervous system is stressed, because learning 


1 Kempf, E. J.: “Psychopathology.” CC. V. Mosby Co., St. Louis, 1920, pp. 
762. 
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subject-matter has chiefly to do with cortical neurones. It is clear 
that if there are neurone-patterns which cannot be modified by instruc- 
tion, the fact is fully as important as that there are patterns which 
can be so modified. If the ‘‘sets” of the organism, which originate 
in the autonomic system, cannot be changed, but can only be sup- 
pressed in action by training, then it is essential to know what account 
should be taken of them in education. 

It is doubtful whether the third of these books! may properly 
claim space in a scientific periodical. The author’s style and intention 
appear to be journalistic rather than scientific. It is not clear as to 
what the author’s training has been, which might qualify him to 
undertake a treatise on the subject considered. It is fairly certain 
that he has not studied psychology systematically, for otherwise 
he would scarcely attribute to ‘‘academic psychologists”’ the remarks 
and doctrines which he does attribute to them, (assuming that he 
means by “academic psychologists” teachers of psychology in uni- 
versities and colleges). Inspection of the recommended bibliography 
confirms this impression. It is improbable that he has devoted him- 
self to biology, for he believes that heredity is a matter of small 
importance in the study of behavior, thus aligning himself with 
naive majority opinion, as opposed to expert opinion. “Insanity, 
feeblemindedness or criminality are not inherited characters. They 
are often acquired through either imitation or suggestion, or both”’ 
(p. 120). ‘‘Most of our heredity is pseudo-heredity which, being 
simply the shaping influence of our environment, can be defeated as 
soon as we realize that it is not working for our welfare” (p. 126). 
These sentences may serve to indicate the author’s back-ground in 
biology. 

One who appears to be untutored in psychology and in biology 
will scarcely command respect as an exponent of scientific thought 
about human conduct. There seems to the present reviewer to be no 
reason for commenting further upon this book. 

As the study of human nature progresses, it becomes more cer- 
tain that educators cannot ‘‘eradicate”’ instincts, as formerly it was 
- thought that they might. It becomes more and more questionable as 
to whether education can even modify the inborn impulses of man, 
much less eradicate them. Education does succeed in modifying 


1 Tridon, A.: “ Psycho-Analysis and Behavior.” Knopf, New York, 1920. 
pp. 354. 
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motor response; but what is the expense account attached to the 
modification? Is all the inhibition of instinctive action, which is 
secured by education, pure gain? Or is there a heavy wastage of 
abnormal conduct, eventuating as a compromise between the per- 
sistent original impulse and the idea or habit that has been taught? 
If so, is this waste avoidable through improved methods of education? 
Books like those of Rivers and Kempf have value for educators, 
because they stimulate thought about these questions, though from 
their disagreement on fundamental issues they make clear that final 
answers cannot yet be given. 

It is remarkable that in a constant perusal of the psycho-analytic 
literature now accumulated and current, reference is never seen to the 
laws of learning, which have been established by the laboratory study 
of animals. The volumes here reviewed make no reference to this 
experimental work; yet to the educational psychologist the laws of 
animal learning seem adequate to include the phenomena of abnormal 
human behavior. Thorndike discovered how a hungry monkey 


- learns to rush to the top of his cage, when food is placed at the bottom; 


how to teach a kitten to scratch itself immediately, when restrained 
behind confining bars. Have these discoveries no meaning for those 
who write about psycho-analysis? Or do they never come in contact 
with the literature of experimental psychology? In the case of 
Rivers, at least, one feels constrained to assume the first of these 
alternatives. 

Leta 8S. HoLLINGWorRTH. 
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